This post is for my friends at ICPSR (i think there's one or two of you there who will still admit to knowing me).
I came across some notes that I had made in the twilight of my career as a data processor. For those of you who do not know what ICPSR does, it is (or claims to be, anyway) the world's largest archive of quantitative social science data. Investigators send their materials to Ann Arbor for preservation and distribution. The staff at ICPSR are charged with preserving the original materials (data collection instruments, interviewing guides, methodological memorandums, electronic datasets, etc) as well as packaging this material so someone else can profitably use it to pursue a research question. This involves converting the data into a format that can be used by any statistics package; preparing documentation so the analyst will understand how to use the data; and catalog the materials so would-be analysts can find it. For example, Brad Wright is doing some interesting stuff concerning the effects of religion on social behavior. He is able to do this because the investigators of the original studies made their data available through archives. Brad was able to find these datasets through various channels and now he is putting them to good use. Felicia LeClere explains the virtue of sharing data more eloquently than I ever could.
Data processing at ICPSR is an incredibly tedious yet important job. Each and every measurement is checked to ensure that the ranges are correct and that labels are applied. When I started there, the only way to do this was to generate a frequency distribution for each measure and check it. Data sets were corrected on an item-by-item basis. This is an incredibly inefficient and error prone way to work. It occurred to me after a while, that the appropriate use of metadata would both make this task more efficient and enhance the value of the data. In my time there, I'm not sure I convinced anyone that I was right. So, I'm taking another stab now.
What's Metadata?
Briefly, metadata captures information about the information. Lets say that I'm conducting a survey of child welfare professions (because I am). My survey will have 40 measurements (not really, but this is for the sake of simplicity). 10 of these measurements will be nominal questions that capture basic identifying information (male/female, job title, church affiliation, institution that granted one's degree, etc). 20 of these measurements will be ordinal, capturing information that has a discernible hierarchy (high to low) but the increments within this hierarchy are not mathematically precise. For instance, lets say that my sample includes divisional managers, team managers, lineworkers, and support staff. We know that a divisional manager has more authority than the team manager, but we can't say in any precise numerical way how much more. The remaining 10 measurements will be on an interval/ratio scale. These are measures like age, or number of years on the job where a value of 20 is equal to twenty units, and 20 units is twice 10.
These three categories of variables are appropriate for different types of analysis. Interval/ratio measures are numerical variables. I can use these to calculate means, analyze variance, and estimate regression equations. As an analyst, when I examine an interval-ratio variable in a codebook, I want to know 5 things: mean, standard deviation, median, and interquartile range. Rarely do I want to see a frequency distribution on these measures, because that information is more efficiently communicated in these other descriptive statistics. In contrast, ordinal and nominal variables are qualitative. They either arbitrarily label a condition with no implicit quantitative value (Catholic = 1, Protestant = 2, Jewish = 3, etc) or they communicate imprecise variation (low, medium, high). In either case means, standard deviations, medians, and interquartile ranges have limited value here. A frequency table is much more useful. As an analyst, I want to know the meaning of each value for these measures. I may decide to collapse these items into binary variables for some form of regression analysis. Or I may want to use categorical variables in a contingency table.
So What?
Now that I'm using this stuff, I'm finding that social science codebooks could be far more useful than they are. If data processors were to classify the level of measurement on each variable (e.g., nominal, ordinal, interval, & ratio), I could use that information when searching for data. For instance, if I'm teaching data analysis and am introducing my class to regression, I might want to use real data from on an interesting topic. But for the sake of conceptual clarity, I only want to use interval-ratio measures in my example (to minimize the straining of assumptions).
Tagging data in this manner would also make life easier for the data processor. [Well, on one level it creates work... the cognitive work of deciding what level of measure to classify a variable; but this is useful work. Scrolling through hundreds of pages of frequency output for cardinal measures is the opposite of useful.] Once the data are processed in this way, the processor can spend his or her time examining the frequency tables of categorical variables, to ensure the data are accurately labeled. They can glance at the 5 figure summary on interval-ratio measures to make sure they are within range. Once the variables are tagged, generating this output is relatively easy to script.
Why did I write this? Well, I came across these notes and had spent some time on this 2 years ago. I figured they might benefit someone. Or not.
Saturday, May 19, 2007
Friday, May 18, 2007
Mary Douglas (1921 - 2007)
From Rex at Savage Minds I learned that Mary Douglas passed away. Every sociologist who wishes to study the dynamics of institutions should read How Institutions Think.
Professor Douglas will be missed.
Professor Douglas will be missed.
Thursday, May 17, 2007
An amazing reflection on mother
I cried while reading this. The Phil Nugent Experience is now daily reading.
Thanks to Scott McLemee at Crooked Timber for linking to this wonderful writer.
Thanks to Scott McLemee at Crooked Timber for linking to this wonderful writer.
Subscribe to:
Posts (Atom)